Weighted Random Sampling over Data Streams
نویسنده
چکیده
In this work, we present a comprehensive treatment of weighted random sampling (WRS) over data streams. More precisely, we examine two natural interpretations of the item weights, describe an existing algorithm for each case ([2,4]), discuss sampling with and without replacement and show adaptations of the algorithms for several WRS problems and evolving data streams.
منابع مشابه
Weighted Random Sampling (2005; Efraimidis, Spirakis)
The problem of random sampling without replacement (RS) calls for the selection of m distinct random items out of a population of size n. If all items have the same probability to be selected, the problem is known as uniform RS. Uniform random sampling in one pass is discussed in [1, 5, 10]. Reservoir-type uniform sampling algorithms over data streams are discussed in [11]. A parallel uniform r...
متن کاملWeighted Sampling Without Replacement from Data Streams
Weighted sampling without replacement has proved to be a very important tool in designing new algorithms. Efraimidis and Spirakis (IPL 2006) presented an algorithm for weighted sampling without replacement from data streams. Their algorithm works under the assumption of precise computations over the interval [0, 1]. Cohen and Kaplan (VLDB 2008) used similar methods for their bottom-k sketches. ...
متن کاملStratified Reservoir Sampling over Heterogeneous Data Streams
Reservoir sampling is a well-known technique for random sampling over data streams. In many streaming applications, however, an input stream may be naturally heterogeneous, i.e., composed of substreams whose statistical properties may also vary considerably. For this class of applications, the conventional reservoir sampling technique does not guarantee a statistically sufficient number of tupl...
متن کاملApproximate Integration of streaming data
We approximate analytic queries on streaming data with a weighted reservoir sampling. For a stream of tuples of a Datawarehouse we show how to approximate some Olap queries. For a stream of graph edges from a Social Network, we approximate the communities as the large connected components of the edges in the reservoir. We show that for a model of random graphs which follow a power law degree di...
متن کاملSketch-Based Estimation of Subpopulation-Weight
Summaries of massive data sets support approximate query processing over the original data. A basic aggregate over a set of records is the weight of subpopulations specified as a predicate over records’ attributes. Bottom-k sketches are a powerful summarization format of weighted items that includes priority sampling [18] (pri) and the classic weighted sampling without replacement (ws). They ca...
متن کامل